Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

Identifieur interne : 000F12 ( Main/Exploration ); précédent : 000F11; suivant : 000F13

A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

Auteurs : B. Allen [États-Unis] ; Andrea Japzon [États-Unis] ; Palakorn Achananuparp [États-Unis] ; Jung Lee [États-Unis]

Source :

RBID : ISTEX:339038902DE6BB0B9E4B4EE9271F38ADA614AC7A

Abstract

Abstract: Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.

Url:
DOI: 10.1007/978-3-540-73354-6_26


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers</title>
<author>
<name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
</author>
<author>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
</author>
<author>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
</author>
<author>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:339038902DE6BB0B9E4B4EE9271F38ADA614AC7A</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-73354-6_26</idno>
<idno type="url">https://api.istex.fr/document/339038902DE6BB0B9E4B4EE9271F38ADA614AC7A/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000928</idno>
<idno type="wicri:Area/Istex/Curation">000918</idno>
<idno type="wicri:Area/Istex/Checkpoint">000927</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Allen B:a:framework:for</idno>
<idno type="wicri:Area/Main/Merge">000F25</idno>
<idno type="wicri:Area/Main/Curation">000F12</idno>
<idno type="wicri:Area/Main/Exploration">000F12</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers</title>
<author>
<name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>College of Information Science and Technology, Drexel University Philadelphia</wicri:cityArea>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">États-Unis</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">339038902DE6BB0B9E4B4EE9271F38ADA614AC7A</idno>
<idno type="DOI">10.1007/978-3-540-73354-6_26</idno>
<idno type="ChapterID">26</idno>
<idno type="ChapterID">Chap26</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Pennsylvanie</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Pennsylvanie">
<name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
</region>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<name sortKey="Achananuparp, Palakorn" sort="Achananuparp, Palakorn" uniqKey="Achananuparp P" first="Palakorn" last="Achananuparp">Palakorn Achananuparp</name>
<name sortKey="Allen, B" sort="Allen, B" uniqKey="Allen B" first="B." last="Allen">B. Allen</name>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<name sortKey="Japzon, Andrea" sort="Japzon, Andrea" uniqKey="Japzon A" first="Andrea" last="Japzon">Andrea Japzon</name>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
<name sortKey="Lee, Jung" sort="Lee, Jung" uniqKey="Lee J" first="Jung" last="Lee">Jung Lee</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F12 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F12 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:339038902DE6BB0B9E4B4EE9271F38ADA614AC7A
   |texte=   A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024